{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Notebook 4: RAG Human-Like Evaluation - LLM-as-a-Judge\n",
    "\n",
    "In this notebook, we are going to use a high quality LLM to generate evaluation scores (human-like) for RAG system final outputs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use Llama2 70B model to evaluate the example RAG pipeline.\n",
    "The score granulaity is from 1 to 5 where:\n",
    "\n",
    "- **Score 1**: Answer irrelevant or invalid, does not follow the context of the question or is irrelevant\n",
    "- **Score 2**: Answer barely useable, missing significant accurate information  \n",
    "- **Score 3**: Answer mostly helpful, missing some information or added erroneous information\n",
    "- **Score 4**: Answer helpful, room for some improvement, could be more concise\n",
    "- **Score 5**: Answer helpful, accurate, relevant and concise\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1: Load the Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first load the JSON dataset. The structure should be: \n",
    "\n",
    "```\n",
    "{\n",
    "'gt_context': chunk,\n",
    "'document': filename,\n",
    "'question': \"xxxxx\",\n",
    "'gt_answer': \"xxx xxx xxxx\",\n",
    "'contexts': \"xxx xxx xxxx\",\n",
    "'answer':\"xxx xxx xxxx\",\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# The path to your JSON file\n",
    "file_path = 'eval.json'\n",
    "\n",
    "# Read the JSON file\n",
    "with open(file_path, 'r', encoding='utf-8') as file:\n",
    "    data = json.load(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check some of the loaded data\n",
    "print(data[:1])\n",
    "print(\"Number of entries\", len(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set your API key for Nvidia AI Playground"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "invoke_url = \"https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/0e349b44-440a-44e1-93e9-abe8dcb27158\" #Llama 2 70B\n",
    "fetch_url_format = \"https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/\"\n",
    "\n",
    "# do not remove Bearer from Authorization, replace <REPLACE_THIS_WITH_API_KEY> with api key\n",
    "headers = {\n",
    "    \"Authorization\": \"Bearer <REPLACE_THIS_WITH_API_KEY>\",\n",
    "    \"Accept\": \"application/json\",\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Design the LLM-as-a-Judge Prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The evaluation axes are the helpfulness, relevance, accuracy, and level of detail. Prompting the high quality LLM to generate human-like evaluation requires a careful prompt engineering with an explicit instructions\n",
    "\n",
    "We must provide the evaluation criteria and the methodology in the same fashion as if we were giving human instructions on how to evaluate.\n",
    "We also ask the LLM to consider both the reference answer and context (ground truth) when evaluating the response provided by the RAG pipeline.\n",
    "Finally, we ask the LLM to provide a score on a scale of 1-5 (likert scale) and ask it to provide an explanation.\n",
    "\n",
    "Here is an example of judge_template that we will use with Llama2 70B. Notice the evaluation examples provided in the prompt. This will help guide the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LLAMA_PROMPT_TEMPLATE = (\n",
    " \"<s>[INST] <<SYS>>\"\n",
    " \"{system_prompt}\"\n",
    " \"<</SYS>>\"\n",
    " \"\"\n",
    " \"Example 1:\"\n",
    " \"[Question]\"\n",
    " \"When did Queen Elizabeth II die?\"\n",
    " \"[The Start of the Reference Context]\"\n",
    " \"\"\"On 8 September 2022, Buckingham Palace released a statement which read: \"Following further evaluation this morning, the Queen's doctors are concerned for Her Majesty's health and have recommended she remain under medical supervision. The Queen remains comfortable and at Balmoral.\"[257][258] Her immediate family rushed to Balmoral to be by her side.[259][260] She died peacefully at 15:10 BST at the age of 96, with two of her children, Charles and Anne, by her side;[261][262] Charles immediately succeeded as monarch. Her death was announced to the public at 18:30,[263][264] setting in motion Operation London Bridge and, because she died in Scotland, Operation Unicorn.[265][266] Elizabeth was the first monarch to die in Scotland since James V in 1542.[267] Her death certificate recorded her cause of death as old age\"\"\"\n",
    " \"[The End of Reference Context]\"\n",
    " \"[The Start of the Reference Answer]\"\n",
    " \"Queen Elizabeth II died on September 8, 2022.\"\n",
    " \"[The End of Reference Answer]\"\n",
    " \"[The Start of the Assistant's Answer]\"\n",
    " \"She died on September 8, 2022\"\n",
    " \"[The End of Assistant's Answer]\"\n",
    " '\"Rating\": 5, \"Explanation\": \"The answer is helpful, relevant, accurate, and concise. It matches the information provided in the reference context and answer.\"'\n",
    " \"\"\n",
    " \"Example 2:\"\n",
    " \"[Question]\"\n",
    " \"When did Queen Elizabeth II die?\"\n",
    " \"[The Start of the Reference Context]\"\n",
    " \"\"\"On 8 September 2022, Buckingham Palace released a statement which read: \"Following further evaluation this morning, the Queen's doctors are concerned for Her Majesty's health and have recommended she remain under medical supervision. The Queen remains comfortable and at Balmoral.\"[257][258] Her immediate family rushed to Balmoral to be by her side.[259][260] She died peacefully at 15:10 BST at the age of 96, with two of her children, Charles and Anne, by her side;[261][262] Charles immediately succeeded as monarch. Her death was announced to the public at 18:30,[263][264] setting in motion Operation London Bridge and, because she died in Scotland, Operation Unicorn.[265][266] Elizabeth was the first monarch to die in Scotland since James V in 1542.[267] Her death certificate recorded her cause of death as old age\"\"\"\n",
    " \"[The End of Reference Context]\"\n",
    " \"[The Start of the Reference Answer]\"\n",
    " \"Queen Elizabeth II died on September 8, 2022.\"\n",
    " \"[The End of Reference Answer]\"\n",
    " \"[The Start of the Assistant's Answer]\"\n",
    " \"Queen Elizabeth II was the longest reigning monarch of the United Kingdom and the Commonwealth.\"\n",
    " \"[The End of Assistant's Answer]\"\n",
    " '\"Rating\": 1, \"Explanation\": \"The answer is not helpful or relevant. It does not answer the question and instead goes off topic.\"'\n",
    "  \"\"\n",
    " \"Follow the exact same format as above. Put Rating first and Explanation second. Rating must be between 1 and 5. What is the rating and explanation for the following assistant's answer\"\n",
    " \"[Question]\"\n",
    " \"{question}\"\n",
    " \"[The Start of the Reference Context]\"\n",
    " \"{ctx_ref}\"\n",
    " \"[The End of Reference Context]\"\n",
    " \"[The Start of the Reference Answer]\"\n",
    " \"{answer_ref}\"\n",
    " \"[The End of Reference Answer]\"\n",
    " \"[The Start of the Assistant's Answer]\"\n",
    " \"{answer}\"\n",
    " \"[The End of Assistant's Answer][/INST]\"\n",
    ")\n",
    "\n",
    "system_prompt = \"\"\"\n",
    "You are an impartial judge that evaluates the quality of an assistant's answer to the question provided.\n",
    "You evaluation takes into account helpfullness, relevancy, accuracy, and level of detail of the answer.\n",
    "You must use both the reference context and reference answer to guide your evaluation.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now call the Judge LLM on the RAG results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# re-use connections\n",
    "session = requests.Session()\n",
    "\n",
    "llama_judge_responses = []\n",
    "for d in data:\n",
    "    try:\n",
    "        prompt = LLAMA_PROMPT_TEMPLATE.format(system_prompt=system_prompt, question=d[\"question\"], ctx_ref=d[\"gt_context\"], answer_ref=d[\"gt_answer\"], answer=d[\"answer\"])\n",
    "        payload = {\n",
    "            \"messages\": [\n",
    "                {\n",
    "                \"content\": prompt,\n",
    "                \"role\": \"user\"\n",
    "                }\n",
    "            ],\n",
    "            \"temperature\": 0.1,\n",
    "            \"top_p\": 1.0,\n",
    "            \"max_tokens\": 200,\n",
    "            \"stream\": False\n",
    "            }\n",
    "\n",
    "        response = session.post(invoke_url, headers=headers, json=payload)\n",
    "\n",
    "        while response.status_code == 202:\n",
    "            request_id = response.headers.get(\"NVCF-REQID\")\n",
    "            fetch_url = fetch_url_format + request_id\n",
    "            response = session.get(fetch_url, headers=headers)\n",
    "\n",
    "        response_body = response.json()\n",
    "        llama_judge_responses.append(response_body['choices'][0]['message']['content'])\n",
    "        print(f\"progress: {len(llama_judge_responses)}/{len(data)}\", end='\\r')\n",
    "    except Exception as e:\n",
    "        print(\"Exception:\", e)\n",
    "        llama_judge_responses.append(None)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parse the rating and evaluations out of the Judge responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import re\n",
    "import statistics\n",
    "\n",
    "# Regular expression pattern to extract rating and explanation\n",
    "rating_pattern = r'Rating:\\s*(\\d+)'\n",
    "explanation_pattern = r'Explanation:\\s*(.+)'\n",
    "\n",
    "llama_ratings = []\n",
    "llama_explanations = []\n",
    "for response in llama_judge_responses:\n",
    "        try:\n",
    "                # Search for the patterns\n",
    "                rating_match = re.search(rating_pattern, response)\n",
    "                explanation_match = re.search(explanation_pattern, response)\n",
    "\n",
    "                # Extract and print the rating and explanation\n",
    "                llama_ratings.append(int(rating_match.group(1)) if rating_match else None)\n",
    "                llama_explanations.append(explanation_match.group(1) if explanation_match else response)\n",
    "        except Exception as e:\n",
    "                print(\"Exception\", e)\n",
    "                llama_ratings.append(None)\n",
    "                llama_explanations.append(response)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a peek at the results!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of judgements:\", len(llama_ratings))\n",
    "print(\"*************************************\")\n",
    "\n",
    "for i, d in enumerate(data[:len(llama_ratings)]):\n",
    "    print(\"Question:\", d[\"question\"])\n",
    "    print(\"Reference Answer:\", d[\"gt_answer\"])\n",
    "    print(\"Answer:\", d[\"answer\"])\n",
    "    print(\"Rating:\", llama_ratings[i])\n",
    "    print(\"Explanation:\", llama_explanations[i])\n",
    "    print(\"*************************************\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's calculate the mean Likert score and then display a historgram of all the ratings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# calculate mean\n",
    "llama_ratings = [1 if r == 0 else r for r in llama_ratings] # Change 0 ratings to 1\n",
    "llama_ratings_filtered = [r for r in llama_ratings if r ] # Remove empty ratings\n",
    "mean = round(statistics.mean(llama_ratings_filtered), 1)\n",
    "print(\"Number of ratings:\", len(llama_ratings_filtered))\n",
    "print(f\"Mean rating: {mean}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.ticker import MaxNLocator\n",
    "import seaborn as sns\n",
    "\n",
    "# Set the style of the visualization\n",
    "sns.set(style=\"white\")\n",
    "\n",
    "# Create a histogram\n",
    "plt.figure(figsize=(10, 6))\n",
    "ax = sns.histplot(llama_ratings_filtered, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5], kde=False)\n",
    "plt.xlim(.5, 5.5)\n",
    "plt.xticks([1, 2, 3, 4, 5])\n",
    "ax.yaxis.set_major_locator(MaxNLocator(integer=True))\n",
    "\n",
    "# Add titles and labels\n",
    "plt.title('Distribution of Ratings')\n",
    "plt.xlabel('Rating')\n",
    "plt.ylabel('Frequency')\n",
    "\n",
    "# Show the plot\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, let's write your evaluation results to a csv file so you can examine them in more detail later.\n",
    "\n",
    "Note: A few LLM Judge evaluation responses may be malformed and therefore unparseable. In these cases the rating and explanation fields will be empty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "results = list(zip(llama_ratings,\n",
    "                   llama_explanations,\n",
    "                   [d[\"question\"] for d in data],\n",
    "                   [d[\"answer\"] for d in data],\n",
    "                   [d[\"gt_answer\"] for d in data],\n",
    "                   [d[\"gt_context\"] for d in data]))\n",
    "\n",
    "output_file = 'judgements.csv'\n",
    "\n",
    "with open(output_file, 'w', newline='') as file:\n",
    "    writer = csv.writer(file)\n",
    "\n",
    "    # headers\n",
    "    writer.writerow(['Rating', 'Explanation', 'Question', 'Answer', 'Groundtruth Answer', 'Groundtruth Context'])\n",
    "\n",
    "    # Write the data\n",
    "    for row in results:\n",
    "        writer.writerow(row)\n",
    "\n",
    "print(f\"Data written to {output_file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bonus! A good practice for improving a RAG pipeline is to look at the responses that were rated poorly and then determine actions to improve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[bad_result for bad_result in results if bad_result[0] == 1]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
